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^ : 1 Introduction 

^ ' Neural networks learn from examples. This concept has extensively been studied 

using models and methods of statistical physics ^ • In particular the following 
scenario has been investigated: Feedforward networks are trained on examples 
generated by a different network. 
l/^ ' Feedforward networks classify high dimensional data, in the simplest case by 

, a single output bit (1/0, wrong/correct, yes/no). They are adaptive algorithms, 

\l ' their parameters (= synaptic weights) are adapting to a set of training examples, 

, in our case a set of input /output pairs. After the training phase, the networks 

' have achieved some knowledge about the rule which has generated the examples, 

the network can classify input vectors which it never has seen before, it can 
d ' generalise. 

Several mathematical models studied before use training examples which are 
generated by a different neural network, called the "teacher" . On-line training 
' ^ , means that the "student", at each training step, receives a new example from 

the teacher network. Each example is used only once for training. Hence, in this 
Q , case training may be considered as dynamics of interacting neural networks: A 

teacher network is sending signals (= examples) to the student network which 
is stepwise changing its weights according to the received message. 
^ ■ Mathematical methods have been developed to calculate the properties of 

the dynamics of interacting networks. In the limit of large networks one can de- 
scribe the system by a differential equation for a few "order parameters" , which 
determine, for example, the generalisation error as a function of the number of 
training examples ||^. 

In this contribution we give an overview over recent work on the theory of 
interacting neural networks. The model is defined in Section 2. The typical 
teacher/student scenario is considered in Section 3. A static teacher network 



> 



1 




Figure 1: Different kinds of interaction discussed in this contribution: (a) static 
teaclier vs. adaptive student, (b) time series generation and prediction, (c) 
self-interaction, (d) mutual learning 

is presenting training examples for an adaptive student network. In the case 
of multilayer networks, the student shows a transition from a symmetric state 
to specialisation. Neural networks can also generate a time series. Training 
on time series and predicting it are studied in Section 4. When a network is 
trained on its own output, it is interacting with itself. Such a scenario has 
implications on the theory of prediction algorithms, as discussed in Section 
5. When a system of networks is trained on its minority decisions, it may be 
considered as a model for competition in closed markets, see Section 6. In 
Section 7 we consider two mutually interacting networks. A novel phenomenon 
is observed: synchronisation by mutual learning. In Section 8 it is shown, how 
this phenomenon can be applied to cryptography: Generation of a secret key 
over a public channel. 

2 On-line training 

The simplest mathematical neural network is the perceptron. It consists of a 
single layer of N synaptic weights w = {wi, wjv). For a given input vector x, 
the output bit is given by 
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cr = sign WiXij (1) 

The decision surface of the perceptron is just a hyperplanc in the A^-dimensional 
input space, w-x — 0. A perceptron may also have a continuous output y as 



y = tanh WiXij (2) 

A perceptron may be considered as an elementary unit of a more complex 
network like an attractor network or a multilayer network. In fact any function 
can be approximated by a multilayer network if the number of hidden units is 
large enough 

The perceptron can learn from examples. Examples are input/output pairs, 

ix{t),a{t)) t^l,...,aN (3) 

On-line training means that at each time step t the weights of the perceptron 
adapt to a new example, for instance by the rule 

w{t + 1) = w{t) + j^a{t)x{t) F{a{t)x{t) ■ w{t)) (4) 

F(z) = 1 is usually called - after the corresponding biological mechanism 
- the Hebbian rule, each synapse Wi responds to the activities a(t)xi(t) at its 
ends. F{z) = <d{—z) is called the Rosenblatt rule: a training step occurs only 
if the example is misclassified. Finally, the Adatron rule F{z) — |2:|0(— z) 
is important since it gives good results for generalisation, as discussed in the 
following. For the last two learning rules, in addition to the two neural activities 
at the synaptic ends, the postsynaptic potential determines the strength of the 
synaptic adaptation. 



3 Generalisation 

Now we consider two perceptrons. One is called the teacher network which is 
producing a set of examples. It is receiving a set of random input vectors x{t) 
and generating output bits <j(i). The teacher has a fixed weight vector ur^ . 

Each time the teacher is producing a new example, the student perceptron 
is trained on it according to (|4|). As a consequence, the weight vector of the 
student w^{t) is time dependent. The student tries to approach the teacher, at 
each training step t its weight vector moves towards the one of the teacher. It is 
easy to see that for random inputs x, equation (^) gives a kind of random walk 
in A^-dimensional space with a bias towards the teacher vector ur^ . 

The distance between student and teacher can be measured by the overlap 

T ^ 

\wT\\wS\ ^ ' 
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The quantity R determines the angle (j) between student and teacher weights, 
R = cos (j)- It turns out that from this overlap the generalisation error eg can 
be calculated. The generalisation error is the probability that the student gives 
an answer to a random input x which is different from the one of the teacher. 
One finds 



arccos R 



(6) 



In the limit of infinitely many input units, N oo, the dynamics of the 
overlap R(t) can be calculated analytically. According to (^, the size of the 
training step scales down with Therefore one defines a variable a = t/N 

which becomes a continuous variable, called time, in the limit of large N. 

The time dependence of i?(a) is obtained by multiplying (|^) by nr^ and 
and by averaging these two equations over the random input vector x. This can 
be done since the expressions w ■ x are Gaussian variables. 

For the Adatron rule one finally obtains the differential equation Q 

dR R „ 1 i?^ 



--- arccos i?+ -(1 - ^) - (7) 
da 2tt TT 2 

The generalisation error calculated by this equation is shown in Fig. |^. If 
only a finite number of examples has been learned, a = 0, the error is 50 %, as 
bad as random guessing. If the number of training examples is of the order of 
N, a > 0, the student network has obtained some overlap to the teacher. In the 
limit of large values of a the error decreases to zero, the student has obtained 
complete knowledge about the parameters of the teacher. 
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Figure 2: Generalisation error as a function of time, from Ref. 
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Figure 3: Generalisation error as a function of time, for a multilayer network 
with two hidden units, from Ref. M 



that the error cannot decay faster than eg cx 0.88a 
rule, one finds e„ oc a~^/^, and for the Hebbian rule e„ 



The asymptotic decay of the generalisation error depends on the learning 
rule. One finds for the Adatron rule eg cx 3/2 a~^. In fact, it has been shown 

For the Rosenblatt 
c But in ah 

cases the student succeeds to approach the teacher when it makes of the order 
of N steps. This even holds when the examples are distorted by noise 

Learning from examples works for more complex networks, too. Here I 
would like to mention the work on specialisation of committee machines [Q. 
Such a network is a multilayer network with several hidden units, similar to 
Fig. H The output bits of the continuous hidden units are summed and taken 
as the output of the network. Teacher and student networks have an identical 
architecture, and the learning step is just a gradient descent of the training 
error, the quadratic deviation between teacher and student output. 

The corresponding generalisation error is shown in Fig. ^. For small number 
of examples it decreases fast, then it reaches a plateau and only for a huge 
number of examples it decreases to zero. 

The motion of the student network can be expressed by the overlap between 
the corresponding members of the two machines. The teacher as well as the 
student consists of two weights vectors, w^^^ and w^^^ ■ A distance between 
teacher and student can be defined from the overlaps 



(8) 



Initially, all vectors are random, hence up to fluctuations all the overlaps are 
zero. Then all overlaps increase due to learning. But on the plateau of Fig. |^ 
the overlaps are all identical, = i?2,2 = ^1.2 • The student has achieved 
some knowledge, but it is in a symmetric state. Only if the student receives 
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much more information it can specialise: Ri i = i?2,2 are much larger than 

Rl,2- 

In the initial process the two members of the student committee act as being 
one single perceptron, but later they specialise and follow their partners in the 
teacher committee until they achieve complete knowledge for a ^ oo. 



4 Time series prediction and generation 

Neural networks are successful prediction algorithms Given a sequence of 
numbers, a neural network can be trained on this sequence by moving it over 
the sequence, as shown in Fig.^. This sequence can be produced by another 
network, called the teacher, by generating a new number and using it as an input 
component in the next step. Hence we have a new kind of interacting networks: 
The teacher with its static weight vector is a bit or sequence generator ||] . The 
student is adapting its weights to the sequence generated by the teacher. 




Figure 4: Time series generation and prediction by a perceptron 



In principle this is the scenario described in the previous section. Here 
the difference is that the patterns are correlated, the input is not random but 
generated from the output of the teacher. 

If a neural network cannot generate a given sequence of numbers, it cannot 
predict it with zero error. Hence one has investigated the generation of time 
series by neural networks 1 11 , 1^, ^ |l^ But this is not the whole story. Even if 
the sequence has been generated by an (unknown) neural network (the teacher), 
a different network (the student) can try to learn and to predict this sequence. 
In this context we are interested in two questions: 

1. When a student network with the identical architecture as the teacher's 
is trained on the sequence, how does the overlap between student and 
teacher develop with the number of training examples (= windows of the 
sequence)? 

2. After the student network has been trained on a part of the sequence, how 
well can it predict the sequence several steps ahead? 

Recently these questions have been investigated numerically for the simple 
perceptron, equation (|l|,||) jl^. Consider a teacher perceptron with weight 
vector vr'" generating the sequence Sq, Si, S2, St, ■■■ This sequence follows the 
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equation 

where f{x) is the transfer function. It has been shown that a perceptron can 
generate simple as weh as complex sequences |l4| . 

If f{x) is monotonic, for instance /(x) = tanh(/3a;), then in general one 
obtains quasiperiodic sequences. In fact, the sequence is essentially generated 
by one Fourier component of the weight vector wf . If the transfer function, 
however, is not monotonic, for instance f{x) = sin(/3a;), then the sequence can 
be chaotic, depending on the model parameters p^ ]. For both cases, learning 
and prediction have been investigated . 

If a quasi periodic sequence is learned on-line, using gradient descent to 
update the weights, 

N 

Aw, ^j^iSt- f(h)) ■ f{h) ■ St-, with h = (10) 

then one has found two time scales (time a means the number of training steps 
divided by N): 

1. A short scale on which the overlap R{a) between teacher and student 
rapidly increases to a value which is still far away from the value R — I, 
which corresponds to perfect agreement. 

2. A long one on which the overlap R{a) increases very slowly. Numerical 
simulations up to lO^N training steps yielded an overlap which was close 
but sell different from the value R — 1. 

Although there is a mathematical theorem on stochastic optimisation which 
seems to guarantee convergence to perfect success [ p^ , the on-line algorithm 
cannot gain much information about the teacher network, at least during rea- 
sonable training periods. 

This is completely different for a chaotic time series generated by a corre- 
sponding teacher network with /(x) = sin(a;). It turns out that the chaotic 
series appears like a random one: After a number of training steps of the order 
of N the overlap relaxes exponentially fast to perfect agreement between teacher 
and student. 

Hence, after training the perceptron with a number of examples of the order 
of N we obtain the two cases: For a quasi periodic sequence the student has not 
obtained much information about the teacher, while for a chaotic sequence the 
student's weight vector comes close to the one of the teacher. One important 
question remains: How well can the student predict the time series? 

Fig.|| shows the prediction error as a function of the time interval over which 
the student makes the predictions. The student network which has been trained 
on the quasi periodic sequence can predict it very well. The error increases 
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Figure 5: Prediction error as a function of time steps ahead, for a quasi periodic 
(lower) and chaotic (upper) series, from Ref. pO| 

Hnearly with the size of the interval, even predicting lOA^ steps ahead yields an 
error of about 10% of the total possible range. On the other side, the student 
trained on the chaotic sequence cannot make predictions. The prediction error 
increases exponentially with time; already after a few steps the error corresponds 
to random guessing, ~ 1. The explanation is that an infinitesimal change in 
the parameters of a chaotic map has the same effect as a small change in initial 
conditions, namely, an exponential growth in the distance between original and 
the disturbed trajectory. 

In summary one finds the counterintuitive result: 

1. A network trained on a quasiperiodic sequence does not obtain much in- 
formation about the teacher network which generated the sequence. But 
the network can predict this sequence over many (of the order of N) steps 
ahead. 

2. A network trained on a chaotic sequence obtains almost complete knowl- 
edge about the teacher network. But this network cannot make reasonable 
long-term predictions on the sequence. 

It would be interesting to find out whether this result also holds for other 
prediction algorithms, such as multi-layer networks. 

5 Self-interaction 

In the previous section the time series was generated by a static teacher net- 
work. Now we consider a network which changes its synaptic weights while it is 
generating a bit sequence. The teacher is interacting with itself. The motivation 
of this investigation stems from the following problem: 
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Consider some arbitrary prediction algorithm. It may contain all the knowl- 
edge of mankind, many experts may have developed it. Now there is a bit se- 
quence Si, 82, ■ ■ ■ and the algorithm has been trained on the first t bits Si, . . . , St- 
Can it predict the next bit Is the prediction error, averaged over a large 

t interval, less than 50%? 

If the bit sequence is random then every algorithm will give a prediction 
error of 50%. But if there are some correlations in the sequence then a clever 
algorithm should be able to reduce this error. In fact, for the most powerful 
algorithm one is tempted to say that for any sequence it should perform better 
than 50% error. However, this is not true To see this just generate a 

sequence Si, S2, S3, . . . using the following algorithm: 



Define St+i to be the opposite of the prediction of this algorithm 
which has been trained on Si, . . . , St ■ 



Now, if the same algorithm is trained on this sequence, it will always predict 
the following bit with 100% error. Hence there is no general prediction machine; 
to be successful the algorithm needs some pre-knowledge about the class of 
problems it is applied to. 

The Boolean perceptron is a very simple prediction algorithm for a bit se- 
quence, in particular with the Hebbian on-line training algorithm (^. What 
does the bit sequence look like for which the perceptron fails completely? 

Following (0), we just have to take the negative value 



St = -sign V WjSt-j (11) 



and then train the network on this new bit: 

Aw, = +^St St-,. (12) 

The perceptron is trained on the opposite (= negative) of its own predic- 
tion. Starting from (say) random initial states , . . . , Sn and weights w, 
this procedure generates a sequence of bits Si, S2, ■ ■ ■ St, ■ ■ ■ and of vectors 
Wi ■ ■ ■ • ■ • as well. Given this sequence and the same initial state, 

the perceptron which is trained on it yields a prediction error of 100%. 

It turns out that this simple algorithm produces a rather complex bit se- 
quence which comes close to a random one ||l7|] . After a transient time the 
weight vector w{t) seems to perform a kind of random walk on an iV-dimensional 
hyper-sphere. The bit sequence runs to a cycle whose average length L scales 
exponentially with N, 

L~2.2^. (13) 

The autocorrelation function of the sequence shows complex properties: It is 
close to zero up to N , oscillates between N and 3A^ and it is similar to random 
noise for larger distances. Its entropy is smaller than the one of a random 
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sequence since the frequency of some patterns is suppressed. Of course, it is 
not random since the prediction error is 100% instead of 50% for a random bit 
sequence. 

When a second perceptron (=student) with different initial state is 
trained on such a anti-predictable sequence generated by Eq.(^l|) it can per- 
form somewhat better than the teacher: The prediction error goes down to 
about 78% but it is still larger than 50% for random guessing. Related to this, 
the student obtains knowledge about the teacher: The angle between the two 
weight vectors relaxes to about 45 degrees 0. Hence the complex anti- 
predictable sequence still contains enough information for the student to follow 
the time dependent teacher. 



6 Agents competing in a closed market 

We just considered a network interacting with itself. Now we extend this model 
to a system of many networks interacting with the minority decision of all 
members. This work was motivated by the following problem of econophysics 

Recently a mathematical model of economy receives a lot of attention in the 
community of statistical physics. It is a simple model of a closed market: There 
are K agents who have to make a binary decision a{t) G {+1, —1} at each time 
step. All of the agents who belong to the minority gain one point, the majority 
has to pay one point (to a cashier which always wins). The global loss is given 

by 

K 



G 



t=i 



(14) 



If the agents come to an agreement before they make a new decision, it is easy 
to minimise G : [K — 1) /2 agents have to choose -|-1, then G — 1. However, this 
is not the rule of the game; the agents are not allowed to make contracts, and 
communicate only through the global sum of decisions. Each agent knows only 
the history of the minority decision, Si, S2, S3, . . ., but otherwise he/she has no 
information. Can the agent find an algorithm to maximise his/her profit? 

If each agent makes a random decision, then (G^) = K. It is possible, but 
not trivial, to find algorithms which perform better than random pof . 

Here we use a perceptron for each agent to make a decision based on the past 
N steps 5 = {St~N, St-^i) of the minority decision. The decision of agent 
w{t) is given by 

a{t) ^ sign(w(t) S). (15) 

After the bit St of the minority has been determined, each perceptron is trained 
on this new example {S_, St), 

^wit) StS. (16) 
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This problem could be solved analytically . The average global loss for 77 ^ 
is given by 

(G2) = (1 - 2/it)K ~ 0.363 K. (17) 

Hence, for small enough learning rates the system of interacting neural networks 
performs better than random decisions. Successful cooperation emerges in a 
pool of adaptive perceptrons. 

7 Synchronisation by mutual learning 

Before, we have considered a pool of several neural networks interacting through 
their minority decisions. Now we study the interaction of just two neural net- 
works Contrary to the teacher/student case, now both of the networks are 
adaptive, each network is learning the output bit of its partner. 



X 




Figure 6: Two perceptrons are trained by their mutual output bits, from Ref. 




In particular, we consider the model where two perceptrons A and B receive 
a common random input vector x and change their weights w according to their 
mutual bit a, as sketched in Fig. ^. The output bit a of a single perceptron is 
given by the equation 

a = sign(w • x) (18) 

x_ is an A^-dimensional input vector with components which are drawn from a 
Gaussian with mean and variance 1. u; is a A'^-dimensional weight vector with 
continuous components which are normalised, 

w-w^\ (19) 
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The initial state is a random choice of the components wf^^^i — 1, N for 
the two weight vectors a-nd w^. At each training step a common random 
input vector is presented to the two networks which generate two output bits 

and according to Now the weight vectors are updated by the 

Rosenblatt learning rule (Q): 

w^{t + l) = w^{t) + ^xa^ ei~a^a^) (20) 

G(x) is the step function. Hence, only if the two perceptrons disagree a training 
step is performed with a learning rate 77. After each step (|2^), the two weight 
vectors have to be normalised. 

In the limit N 00, the overlap 

Rit)=w^{t)w^{t) (21) 

has been calculated analytically |Q . The number of training steps t is scaled 
as a = t/N, and R{a) follows the equation 

where (j) is the angle between the two weight vectors vr^ and , i.e. R = cos (f>. 
This equation has fixed points R = 1, R = —1, and 

V _l^cos0 ^23) 



V27r (p 

Fig.0 shows the attractive fixed point of (|2^) as a function of the learning 
rate rj. For small values of 77 the two networks relax to a state of a mutual 
agreement, R —f 1 for 77 — > 0. With increasing learning rate rj the angle between 
the two weight vectors increases up to the value (j) — 133° for 

T]^ric = 1.816 (24) 

Above the critical rate r]c the networks relax to a state of complete disagreement, 
(f) — 180°, i? = —1. The two weight vectors are antiparallel to each other, 

A R 
W — ~W . 

As a consequence, the analytic solution shows, well supported by numerical 
simulations for N = 100, that two neural networks can synchronise to each 
other by mutual learning. Both of the networks are trained to the examples 
generated by their partner and finally obtain an antiparallel alignment. Even 
after synchronisation the networks keep moving, the motion is a kind of random 
walk on an N-dimensional hypersphere producing a rather complex bit sequence 
of output bits p^ . In fact, after synchronisation the system is 

identical to the single network learning its opposite output bit discussed in 
section 5. 
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Figure 7: Final overlap R between two perceptrons as a function of learning 
rate rj. Above a critical rate r]c the time dependent networks are synchronised. 
From Ref. ^ 

8 Cryptography 

In the field of cryptography, one is interested in methods to transmit secret 
messages between two partners A and B. An opponent E who is able to listen 
to the communication should not be able to recover the secret message. 

Before 1976, all cryptographic methods had to rely on secret keys for en- 
cryption which were transmitted between A and B over a secret channel not 
accessible to any opponent. Such a common secret key can be used, for ex- 
ample, as a seed for a random bit generator by which the bit sequence of the 
message is added (modulo 2). 

In 1976, however, Diffie and Hellmann found that a common secret key could 
be created over a public channel accessible to any opponent. This method is 
based on number theory: Given limited computer power, it is not possible to 
calculate the discrete logarithm of sufficiently large numbers [ |2l[ |. 

Recently, it has been shown how interacting neural networks can produce a 
common secret k ey by exchanging bits over a public channel and by learning 
from each other . 

We want to apply synchronisation of neural networks to cryptography. In 
the previous section we have seen that the weight vectors of two perceptrons 
learning from each other can synchronise. The new idea is to use the common 
weights ur^ — —wP as a key for encryption. But two problems have to be solved 
yet: (i) Can an external observer, recording the exchange of bits, calculate the 
final w_'^{t)^ (ii) does this phenomenon exist for discrete weights? Point (i) 
is essential for cryptography, it will be discussed further below. Point (ii) is 
important for practical solutions since communication is usually based on bit 
sequences. It will be investigated in the following. 
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Synchronisation occurs for normalised weights, unnormahsed ones do not 
synchronise pG]. Therefore, for discrete weights, we introduce a restriction in 

— A / R 

the space of possible vectors and limit the components ' to 2L + 1 different 
values, 

wf^^ e {-L,-L+1,...,L-1,L} (25) 

In order to obtain synchronisation to a parallel ~ instead of an antiparallel - 
state u/^ — , we modify the learning rule (po| ) to: 



w^{t+l) = w^{t)~xa^e{a^a^) (26) 

Now the components of the random input vector x are binary Xi G {+1,-1}. If 
the two networks produce an identical output bit = , then their weights 
move one step in the direction of —xia^. But the weights should remain in 
the interval (|2^), therefore if any component moves out of this interval, it is set 
back to the boundary Wi = ±L. 

Each component of the weight vectors performs a kind of random walk 
with reflecting boundary. Two corresponding components wf and w!j^ receive 
the same random number ±1. After each hit at the boundary the distance 
\wf — wf\ is reduced until it has reached zero. For two perceptrons with a 
A^-dimensional weight space we have two ensembles of N random walks on the 
internal {—L,...,L}. If we neglect the global signal = as well as the 
bias cr^, we expect that after some characteristic time scale t = O(L^) the 
probability of two random walks being in different states decreases as 

P{t) - P(0)e-*/^ (27) 

Hence the total synchronisation time should be given by N ■ P{t) ~ 1 which 
gives 



Uync-TlnN (28) 

In fact, the simulations for N = 100 show that two perceptrons with L = 3 
synchronise in about 100 time steps and the synchronisation time increases 
logarithmically with N. However, the simulations also showed that an opponent, 
recording the sequence of {cr^,(T^ ,x)t is able to synchronise, too. Therefore, a 
single perceptron does not allow a generation of a secret key. 

Obviously, a single perceptron transmits too much information. An oppo- 
nent, who knows the set of input/output pairs, can derive the weights of the two 
partners after synchronisation. Therefore, one has to hide so much information, 
that the opponent cannot calculate the weights, but on the other side one has 
to transmit enough information that the two partners can synchronise. 

In fact, it was shown that multilayer networks with hidden units may be 
candidates for such a task |22|. More precisely, we consider parity machines 
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Figure 8: Parity machine with three hidden units. 

with three hidden units as shown in Fig.^. Each hidden unit is a perceptron 
with discrete weights (^) . The output bit r of the total network is the product 
of the three bits of the hidden units 



„A A A 
(Ti CTn (7, 



= (Ti (Jq ao 



(29) 



At each training step the two machines A and B receive identical input vectors 
Xi,X2,x^. The training algorithm is the following: Only if the two output bits 
are identical, = r^, the weights can be changed. In this case, only the 
hidden unit Ci which is identical to r changes its weights using the Hebbian rule 



(30) 



For example, if = = 1 there are four possible configurations of the hidden 
units in each network: 

(+1,+1,+1),(+1,-1,-1),(-1,+1,-1),(-1,-1,+1) 
In the first case, all three weight vectors Wj, w_2,Wji are changed, in all other three 
cases only one weight vector is changed. The partner as well as any opponent 
does not know which one of the weight vectors is updated. 

The partners A and B react to their mutual stop and move signals and 
r^, whereas an opponent can only receive these signals but not influence the 
partners with its own output bit. This is the essential mechanism which allows 
synchronisation but prohibits learning. Numerical |Q as well as analytical |Q 
calculations of the dynamic process show that the partners can synchronise in 
a short time whereas an opponent needs a much longer time to lock into the 
partners. 



This observation holds for an observer who uses the same algorithm (30) as 
the two partners A and B. Note that the observer knows 1. the algorithm of A 
and B, 2. the input vectors x_itX2,x^ at each time step and 3. the output bits 

and at each time step. Nevertheless, it does not succeed in synchronising 
with A and B within the communication period. 

Since for each run the two partners draw random initial weights and since the 
input vectors are random, one obtains a distribution of synchronisation times 
as shown in Fig. ^ for iV = 100 and L = 3. The mean value of this distribution 
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Figure 9: Distribution of synchronisation time for N = 100, L — 3, from Ref.[^ 
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Figure 10: Average synchronisation time as a function of inverse system size. 
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Figure 11: Distribution of the ratio of synchronisation time between networks 
A and B to the learning time of an attacker E. 



is shown as a function of system size N in Fig. |l^. Even an infinitely large 
network needs only a finite number of exchanged bits - about 400 in this case - 
to synchronise. 

If the communication continues after synchronisation, an opponent has a 
chance to lock into the moving weights of A and B. Fig. 11 shows the distribution 
of the ratio between the synchronisation time of A and B and the learning time 
of the opponent. In the simulations for N = 100, this ratio never exceeded the 
value r = 0.1, and the average learning time is about 50000 time steps, much 
larger than the synchronisation time. Hence, the two partners can take their 



weights wf{t) 



(t) at a time step t where synchronisation most probably 



occurred as a common secret key. Synchronisation of neural networks can be 
used as a key exchange protocol over a public channel. 

Up to now it is not clear, yet, whether more advanced attacks will finally 
break this exchange protocol. On the other side, there are several possible 
extensions of the synchronisation mechanism where tracking seems to be even 
harder ||2l. 



9 Conclusions 

The dynamics of interacting neural networks has been studied in the context 
of a simple model: the perceptron and its extensions. The dynamics of these 
models can be calculated analytically; macroscopic properties can be described 
by differential equations for a few order parameters. 

Several kinds of interaction processes have been studied. In all cases the 
networks are trained by a set of examples which are generated by different 
networks. This is the way networks interact. Some networks are generating pairs 



17 



of high dimensional input data and a corresponding output signal and transmit 
this information to other networks. The networks receiving this information 
are adapting their parameters - their synaptic weights - to each example. One 
question is to what extend the networks exchanging information are approaching 
each other in the high dimensional synaptic space. 

The teacher /student scenario of a static networks generating the examples 
in addition to an adaptive network being trained on these examples is the case 
studied most. The question is: How well does the student learn the rule which 
is producing the examples, how can it generalise? The analytic solutions for 
the perccptron show the number of training examples has to be of the number 
of neurons to achieve generalisation. For the optimal training algorithm the 
generalisation error decays not faster than the inverse power of training time. 

If both teacher and student networks arc more complex then new phenomena 
are observed. For example, for a committee machine specialisation occurs: For 
short times the student relaxes fast to a configuration for which it has achieved 
generalisation but it is still in a symmetric state, it behaves like a simple pcr- 
ceptron. Only if the number of examples is increased to a very large value the 
student network can escape from this configuration and each member of the 
committee specialises to its corresponding member in the teacher network. 

A static teacher network can also generate a time series on which the student 
network is trained. In addition to the overlap between teacher and student 
parameters, one is interested in the prediction abilities of the student after the 
interaction period. It turns out that learning and prediction are not necessarily 
correlated. One finds perceptrons which learn a chaotic sequence very well but 
cannot predict it. On the other side, for quasipcriodic sequences it is difficult 
to learn the weights of the teacher but it is simple to predict the corresponding 
sequence. 

A very general question about properties of prediction algorithms leads to 
a perceptron interacting with itself. It produces a rather complex time series 
which yields 100 % prediction error if the same perceptron is trained on this 
sequence. But even if a different perceptron is trained on this sequence, it 
achieves some knowledge about the teacher, hence its prediction error is larger 
than 50 %. 

A community of neural networks can exchange information and learn from 
each other. It is shown how such a scenario can lead to successful cooperation 
in the minority game - a model for competing agents in a closed market. 

If two networks are exchanging information and learning from each other, 
they can synchronise. That means, after some training time they relax to a 
configuration with identical time dependent synaptic weights (up to a common 
sign). The two networks keep diffusing on a high dimensional hypcrsphere, but 
with identical weights. Neural networks can synchronise by mutual learning. 

This new phenomenon is applied to cryptography. It is shown how multi- 
layer networks with discrete weights synchronise after a few hundred steps of 
interactions. However, a third network which is recording the exchange of ex- 
amples does not synchronise, at least during a short period of time. Learning by 
mutual learning is fast, but learning by listening is very slow. Hence two part- 
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ncrs can agree on a common secret key over a public channel. Any observer who 
knows all the details of the algorithm and who knows all the training examples 
cannot calculate the secret key. This is - to my knowledge - the first public key 
exchange which is not based on number theory. Future research will show how 
well this new algorithm - based on mutual learning of neural networks - can 
resist any advanced attacks which still have to be invented. 

Synchronisation of neural networks is an active subject of research in neu- 
robiology. Up to now it is - to my knowledge - unclear how synchronisation 
develops and what is its function. Here the model calculations point to a new 
direction: Two or several biological networks can achieve a common time de- 
pendent state by learning information exchanged between active partners. Any 
other network receiving the same information without being able to influence 
the partners cannot lock into this time dependent common state. Hence even 
in a fully connected network, parts of it can synchronise by mutual learning, at 
the same time screening their synchronised state from parts which learn only 
by listening. 

In summary, this is the first attempt to develop a theory of interacting neu- 
ral networks. Several phenomena were discovered from simple models like the 
perceptron or the parity machines. These phenomena were neither included 
into the models from the beginning nor are obvious; they are a result of coop- 
erative behaviour of the synaptic weights and can only be understood from the 
analytical and numerical calculations. 

As the different scenarios described in this overview show, the first results 
of this theory of interacting adaptive systems may be relevant in the fields 
of cooperative systems, nonlinear dynamics, time series prediction, economic 
models, biological networks and cryptography. 
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